729 research outputs found

    A comparison of personal name matching: Techniques and practical issues

    No full text
    Finding and matching personal names is at the core of an increasing number of applications: from text and Web mining, information retrieval and extraction, search engines, to deduplication and data linkage systems. Variations and errors in names make exact string matching problematic, and approximate matching techniques based on phonetic encoding or pattern matching have to be applied. When compared to general text, however, personal names have different characteristics that need to be considered. ¶ In this paper we discuss the characteristics of personal names and present potential sources of variations and errors. We overview a comprehensive number of commonly used, as well as some recently developed name matching techniques. Experimental comparisons on four large name data sets indicate that there is no clear best technique. We provide a series of recommendations that will help researchers and practitioners to select a name matching technique suitable for a given data set

    Towards Parameter-free Blocking for Scalable Record Linkage

    Get PDF
    Linking or matching databases is becoming increasingly important in many data mining projects, as linked data can contain information that is not available otherwise, or that would be too expensive to collect. A main challenge when linking large databases is the complexity of the linkage process: potentially each record in one database has to be compared with all records in the other database. Various techniques, collectively know as `blocking', have been developed to deal with this quadratic complexity. Most of these techniques require several parameters to be set by the user in order to achieve good results. In this paper we evaluate six blocking techniques within a common framework with regard to the number and quality of the candidate record pairs generated. We propose a modification to two existing techniques that reduces the variance in the quality of the blocking results over a range of parameter values, enabling more robust, practical record linkage without the need of time consuming manual parameter tuning

    Noise-tolerant approximate blocking for dynamic real-time entity resolution

    Get PDF
    Entity resolution is the process of identifying records in one or multiple data sources that represent the same real-world entity. This process needs to deal with noisy data that contain for example wrong pronunciation or spelling errors. Many real world applications require rapid responses for entity queries on dynamic datasets. This brings challenges to existing approaches which are mainly aimed at the batch matching of records in static data. Locality sensitive hashing (LSH) is an approximate blocking approach that hashes objects within a certain distance into the same block with high probability. How to make approximate blocking approaches scalable to large datasets and effective for entity resolution in real-time remains an open question. Targeting this problem, we propose a noise-tolerant approximate blocking approach to index records based on their distance ranges using LSH and sorting trees within large sized hash blocks. Experiments conducted on both synthetic and real-world datasets show the effectiveness of the proposed approach

    Time-aware topic recommendation based on micro-blogs

    Get PDF
    Topic recommendation can help users deal with the information overload issue in micro-blogging communities. This paper proposes to use the implicit information network formed by the multiple relationships among users, topics and micro-blogs, and the temporal information of micro-blogs to find semantically and temporally relevant topics of each topic, and to profile users' time-drifting topic interests. The Content based, Nearest Neighborhood based and Matrix Factorization models are used to make personalized recommendations. The effectiveness of the proposed approaches is demonstrated in the experiments conducted on a real world dataset that collected from Twitter.com

    Context Aware Computing for The Internet of Things: A Survey

    Get PDF
    As we are moving towards the Internet of Things (IoT), the number of sensors deployed around the world is growing at a rapid pace. Market research has shown a significant growth of sensor deployments over the past decade and has predicted a significant increment of the growth rate in the future. These sensors continuously generate enormous amounts of data. However, in order to add value to raw sensor data we need to understand it. Collection, modelling, reasoning, and distribution of context in relation to sensor data plays critical role in this challenge. Context-aware computing has proven to be successful in understanding sensor data. In this paper, we survey context awareness from an IoT perspective. We present the necessary background by introducing the IoT paradigm and context-aware fundamentals at the beginning. Then we provide an in-depth analysis of context life cycle. We evaluate a subset of projects (50) which represent the majority of research and commercial solutions proposed in the field of context-aware computing conducted over the last decade (2001-2011) based on our own taxonomy. Finally, based on our evaluation, we highlight the lessons to be learnt from the past and some possible directions for future research. The survey addresses a broad range of techniques, methods, models, functionalities, systems, applications, and middleware solutions related to context awareness and IoT. Our goal is not only to analyse, compare and consolidate past research work but also to appreciate their findings and discuss their applicability towards the IoT.Comment: IEEE Communications Surveys & Tutorials Journal, 201

    Context-aware Dynamic Discovery and Configuration of 'Things' in Smart Environments

    Full text link
    The Internet of Things (IoT) is a dynamic global information network consisting of Internet-connected objects, such as RFIDs, sensors, actuators, as well as other instruments and smart appliances that are becoming an integral component of the future Internet. Currently, such Internet-connected objects or `things' outnumber both people and computers connected to the Internet and their population is expected to grow to 50 billion in the next 5 to 10 years. To be able to develop IoT applications, such `things' must become dynamically integrated into emerging information networks supported by architecturally scalable and economically feasible Internet service delivery models, such as cloud computing. Achieving such integration through discovery and configuration of `things' is a challenging task. Towards this end, we propose a Context-Aware Dynamic Discovery of {Things} (CADDOT) model. We have developed a tool SmartLink, that is capable of discovering sensors deployed in a particular location despite their heterogeneity. SmartLink helps to establish the direct communication between sensor hardware and cloud-based IoT middleware platforms. We address the challenge of heterogeneity using a plug in architecture. Our prototype tool is developed on an Android platform. Further, we employ the Global Sensor Network (GSN) as the IoT middleware for the proof of concept validation. The significance of the proposed solution is validated using a test-bed that comprises 52 Arduino-based Libelium sensors.Comment: Big Data and Internet of Things: A Roadmap for Smart Environments, Studies in Computational Intelligence book series, Springer Berlin Heidelberg, 201
    • …
    corecore